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Abstract 

Given samples from two distributions over an n-element set, we wish to test whether these dis- 
tributions are statistically close. We present an algorithm which uses sub-linear in n, specifically, 
0(n 2 / 3 e~ 8 / 3 logn), independent samples from each distribution, runs in time linear in the sample 
size, makes no assumptions about the structure of the distributions, and distinguishes the cases 
when the distance between the distributions is small (less than max{e 4 / 3 n~ 1 / 3 /32, en~ 1 / 2 /4}) 
or large (more than e) in £% distance. This result can be compared to the lower bound of 
f2(n 2 / 3 e -2 / 3 ) for this problem given by Valiant |54) . 

Our algorithm has applications to the problem of testing whether a given Markov process is 
rapidly mixing. We present sublinear algorithms for several variants of this problem as well. 
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1 Introduction 



Suppose we have two distributions over the same n-element set, such that we know nothing about 
their structure and the only access we have to these distributions is the ability to take independent 
samples from them. Suppose further that we want to know whether these two distributions are 
close to each other in i\ norm0 A first approach, which we refer to as the naive approach, would 
be to sample enough elements from each distribution so that we can approximate the distribution 
and then compare the approximations. It is easy to see (see Theorem 1231 in Section f3.2[) that this 
naive approach requires the number of samples to be at least linear in n. 

In this paper, we develop a method of testing that the distance between two distributions 
is at most e using considerably fewer samples. If the distributions have l\ distance at most 
max{e 4//3 n _1 / 3 /32, en _1 / 2 /4}, then the algorithm will accept with probability at least 1 — 5. If 
the distributions have t\ distance more than e then the algorithm will accept with probability 
at most 5. The number of samples used is 0(n 2 / 3 e _8//3 log(ra/<5)). In contrast, the methods of 
Valiant [53], fixing the incomplete arguments in the original conference paper (see Section [3]), yield 
an f2(n 2 / 3 e -2 / 3 ) lower bound for testing £\ distance in this model. 

Our test relies on a test for whether two distributions have small 1% distance, which is consider- 
ably easier to test: we give an algorithm with sample complexity independent of n. However, small 
0.2 distance does not in general give a good measure of the closeness of two distributions according 
to £\ distance. For example, two distributions can have disjoint support and still have £2 distance 
of 0(1/ y/n). Still, we can get a very good estimate of the £2 distance, say to within 0(1/ y/n) 
additive error, and then use the fact that the £\ distance is at most y/n times the £2 distance. 
Unfortunately, the number of queries required by this approach is too large in general. Because of 
this, our £\ test is forced to distinguish between two cases. 

For distributions with small £2 norm, we show how to use the £2 distance to get an efficient test 
for l\ distance. For distributions with larger £2 norm, we use the fact that such distributions must 
have elements which occur with relatively high probability. We create a filtering test that partitions 
the domain into those elements with relatively high probability and all the other elements (those 
with relatively low probability). The test estimates the £\ distance due to these high-probability 
elements directly, using the naive approach mentioned above. The test then approximates the 
£\ distance due to the low-probability elements using the test for £2 distance. Optimizing the 
notion of "high probability" yields our 0(n 2 / 3 e -8 / 3 log(n/5)) algorithm. The £2 distance test uses 
0(e~ 4 log(l/£)) samples. 

Applying our techniques to Markov chains, we use the above algorithm as a basis for construct- 
ing tests for determining whether a Markov chain is rapidly mixing. We show how to test whether 
iterating a Markov chain for t steps causes it to reach a distribution close to the stationary distri- 
bution. Our testing algorithm works by following 0(in 5 / 3 ) edges in the chain. When the Markov 
chain is dense enough and represented in a convenient way (such a representation can be computed 
in linear time and we give an example representation in Section 0]) , this test remains sublinear in 
the size of the Markov chain for small t. We then investigate two notions of being close to a rapidly 
mixing Markov chain that fall within the framework of property testing, and show how to test that 
a given Markov chain is close to a Markov chain that mixes in t steps by following only 0(tn 2 ^) 
edges. In the case of Markov chains that come from directed graphs and pass our test, our theorems 
show the existence of a directed graph that is both close to the original one and rapidly mixing. 

1 Half of £1 distance between two distributions is also referred to as total variation distance. 
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1.1 Related Work 



Testing Properties of Distributions The use of collision statistics in a sample has been pro- 
posed as a technique to test whether a distribution is uniform (see, for example, Knuth |40j). 
Goldreich and Ron [32] give the first formal analysis that using 0(y/n) samples to estimate the 
collision probability yields an algorithm which gives a very good estimate of the I2 distance be- 
tween the given distribution and the uniform distribution. Their "collision count" idea underlies 
the present paper. More recently, Paninski [45J presents a test to determine whether a distribution 
is far from the uniform distribution with respect to l\ distance using ®(yjn/e 2 ) samples. Ma [12] 
also uses collisions to measure the entropy of a distribution defined by particle trajectories. After 
the publication of the preliminary version of this paper, a long line of publications appeared re- 
garding testing properties of distributions including independence, entropy, and monotonicity (see, 
for example, [SI HUl HH El [Ml SSI [Ml EH [1] ) . 

Expansion, Rapid Mixing, and Conductance Goldreich and Ron [32] present a test that they 
conjecture can be used to give an algorithm with 0(y/n) query complexity which tests whether a 
regular graph is close to being an expander, where by close they mean that by changing a small 
fraction of the edges they can turn it into an expander. Their test is based on picking a random node 
and testing whether random walks from this node reach a distribution that is close to the uniform 
distribution on the nodes. Our tests for Markov chains are based on similar principles. Mixing and 
expansion are known to be related [53] . but our techniques only apply to the mixing properties 
of random walks on directed graphs, since the notion of closeness we use does not preserve the 
symmetry of the adjacency matrix. More recently, a series of papers |22[ [371 H3] answer Goldreich 
and Ron's conjecture in the affirmative. In a previous work, Goldreich and Ron [31] show that 
testing that a graph is close to an expander requires VL(n 1 / 2 ) queries. 

The conductance [53J of a graph is known to be closely related to expansion and rapid-mixing 
properties of the graph |38[ I53j . Frieze and Kannan [27J show, given a graph G with n vertices 
and a, one can approximate the conductance of G to within additive error a in time n ■ 2°^°^\ 
Their techniques also yield an 2 poly ( 1//e )-time test that determines whether the adjacency matrix of 
a graph can be changed in at most e fraction of the locations to get a graph with high conductance. 
However, for the purpose of testing whether an ra-vertex, m-edge graph is rapid mixing, we would 
need to approximate its conductance to within a = 0(m/n 2 ); thus, only when m = @(n 2 ), would 
the algorithm in (27| run in 0(n) time. 

We now discuss some other known results for testing of rapid mixing through eigenvalue com- 
putations. It is known that mixing [53[ [38] is related to the separation between the two largest 
eigenvalues [4] . Standard techniques for approximating the eigenvalues of a dense nx n matrix run 
in @(n 3 ) floating-point operations and consume 0(n 2 ) words of memory [33]. However, for a sparse 
nx n symmetric matrix with m nonzero entries, n < m, "Lanczos algorithms" |46j accomplish the 
same task in 0(n(m + logn)) floating-point operations, consuming B(n + m) storage. Furthermore, 
it is found in practice that these algorithms can be run for far fewer, even a constant number, of 
iterations while still obtaining highly accurate values for the outer and inner few eigenvalues. 

Streaming There is much work on the problem estimating the distance between distributions 
in data streaming models where space rather than time is limited (cf., |28[ [31 [2l"l I26j). Another 
line of work [15] estimates the distance in frequency count distributions on words between various 
documents, where again space is limited. Guha et al. [35] have extended our result to estimating 
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the closeness of distribution with respect to a range of /-divergences, which include l\ distance. 
Testing distributions in streaming data models has been an active area of research in the recent 
years (see, for example, [ID [13 EH EH [13 (IB1 [13 [H] ) ■ 




Other Related Models In an interactive setting, Sahai and Vadhan [52] show that, given 

distributions p and q generated by polynomial-size circuits, the problem of distinguishing whether 
p and q are close or far in l\ norm is complete for statistical zero knowledge. Kannan and Yao [39] 
outlines a program checking framework for certifying the randomness of a program's output. In 
their model, one does not assume that samples from the input distribution are independent. 

Computational Learning Theory There is a vast literature on testing statistical hypotheses. 
In these works, one is given examples chosen from the same distribution out of two possible choices, 
say p and q. The goal is to decide which of two distributions the examples are coming from. More 
generally, the goal can be stated as deciding which of two known classes of distributions contains 
the distribution generating the examples. This can be seen to be a generalization of our model 
as follows: Let the first class of distributions be the set of distributions of the form q x q. Let 
the second class of distributions be the set of distributions of the form qi x q2 where the l\ 
difference of qi and q2 is at least e. Then, given examples from two distributions pi,P2, create a 
set of example pairs (x, y) where x is chosen according to pi and y according to P2 independently. 
Bounds and an optimal algorithm for the general problem for various distance measures are given 
in [191144} 120 1 121U41] . None of these give sublinear bounds in the domain size for our problem. The 
specific model of singleton hypothesis classes is studied by Yamanishi [56]. 

1.2 Notation 

We use the following notation. We denote the set {l,...,n} with [n]. The notation x Gr [n] 
denotes that x is chosen uniformly at random from the set [n]. The i\ norm of a vector v is 
denoted by ||v||i and is equal to Xa=i \ v i\- Similarly, the £2 norm is denoted by 1 1 v 1 1 2 and is equal 



to yYl7=i v h ano - ll v ll°o = max il u il- We assume our distributions are discrete distributions over 
n elements, with labels in [n], and will represent such a distribution as a vector p = (pi, . . . ,p n ), 
where pi is the probability of outputting element i. 

The collision probability of two distributions p and q is the probability that a sample from each 
of p and q yields the same element. Note that, for two distributions p,q, the collision probability 
is p • q = Y^iPili- T° av °id ambiguity, we refer to the collision probability of p and p as the 
self-collision probability of p. Note that the self-collision probability of p is ||p>[| 2- 

2 Testing Closeness of Distributions 

The main goal of this section is to show how to test whether two distributions p and q are close 
in i\ norm in sublinear time in the size of the domain of the distributions. We are given access to 
these distributions via black boxes which upon a query respond with an element of [n] generated 
according to the respective distribution. Our main theorem is: 

Theorem 1 Given parameters 5 and e, and distributions p,q over a set of n elements, there is a 
test which runs in time 0(n 2 / 3 e -8 / 3 log(n/<5)) such that, if \\p — q||i < max( ^ ^= , jy= ) , then the 
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test accepts with probability at least 1 — 5 and, if ||p — q||i > e, then the test rejects with probability 
at least 1 — 5. 



In order to prove this theorem, we give a test which determines whether p and q are close in 
£2 norm. The test is based on estimating the self-collision and collision probabilities of p and q. 
In particular, if p and q are close, one would expect that the self-collision probabilities of each are 
close to the collision probability of the pair. Formalizing this intuition, in Section [2.11 we prove: 

Theorem 2 Given parameter 5 and e, and distributions p and q over a set of n elements, there 
exists a test such that, if ||p — q| 1 2 < e/2, then the test accepts with probability at least 1 — 5 and, 
if Hp — q|b > e, then the test rejects with probability at least 1 — 5. The running time of the test is 



The test used to prove Theorem [2] is given below in Figure [TJ The number of pairwise self-collisions 
in multiset F C [n] is the count of i < j such that the ith sample in F is same as the jth sample 
in F. Similarly, the number of collisions between Q p C [n] and Q q C [n] is the count of (i,j) such 
that the zth sample in Q p is same as the jth. sample in Q q . 

£2-Distance-Test(p, q, m, e, 5) 

Repeat O (log (1/5)) times 

1. Let F p and F q be multisets of m samples from p and q, respectively. Let r p and r q 
be the numbers of pairwise self-collisions in F p and F q , respectively. 

2. Let Q p and Q q be multisets of m samples from p and q, respectively. Let s pq be the 
number of collisions between Q p and Q q . 

3. Let r = ^ I (r p + r q ). Let s = 2s pq . 

4. If r — s > 3m 2 e 2 /4, then reject the current iteration. 
Reject if the majority of iterations reject, accept otherwise. 



We use the parameter m to indicate the number of samples needed by the test to get constant 
confidence. In order to bound the £2 distance between p and q by e, setting m = 0(4r) suffices. By 
maintaining arrays which count the numbers of times, for example, N p (i) for F p , that each element 



running time bounds for computing an estimate of the collision probability. In this way, essentially 
m 2 estimations of the collision probability can be performed in 0(m) time. 

Since ||w||i < y/n ■ \\v\\2, a simple way to extend the above test to an L\ distance test is by 
setting e 1 = e/\/n. This would give the correct output behavior for the tester. Unfortunately, due 
to the order of the dependence on e in the £2 distance test, the resulting running time is quadratic 
in n. It is possible, though, to achieve sublinear running times if the input distributions are known 
to be reasonably evenly distributed. We make this precise by a closer analysis of the variance of 
the estimator in the test in Lemma [5j In particular, we analyze the dependence of the variances 
of s and r on the parameter b = maxdlpHoo, ||q||oo)- There we show that given p and q such that 
b = 0(n~ a ), one can call ^-Distance- Test with an error parameter of ^= and achieve running 

time of 0(e- 4 (n 1 ~ Q / 2 + n 2 " 2 ")). Thus, when the maximum probability of any element is bounded, 
the £2 distance test can in fact yield a sublinear-time algorithm for testing closeness in L\ distance. 



4 log(l/<5)). 



Figure 1: Algorithm ^-Distance- Test 
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In the previous paragraph, we have noted that, for distributions with a bound on the maximum 
probability of any element, it is possible to test closeness with time and queries sublinear in the 
domain size. On the other hand, when the minimum probability element is quite large, the naive 
approach that we referred to in the introduction can be significantly more efficient. This suggests 
a filtering algorithm, which separates the domain of the distributions being tested into two parts 
- the big elements, or those elements to which the distributions assign relatively high probability 
weight, and the small elements, which are all other elements. Then, the naive tester is applied to 
the distributions restricted to the big elements, and the tester that is based on estimating the £2 
distance is applied to the distributions restricted to the small elements. 

More specifically, we use the following definition to identify the elements with large weights. 

Definition 3 (Big element) An element i is called big with respect to a distribution p if pi > 

(e/n) 2 / 3 . 

The complete test is given below in Figure EJ The proof of Theorem Q] is presented in Section 12.21 

^i-Distance-Test(p, q, e, 5) 

1. Let b = (e/n) 2 / 3 . 

2. Sample p and q for M = 0(e _8 / 3 n 2 / 3 log(n/5)) times. 

3. Let S p and £ q be the sample sets obtained from p and q, respectively, by discarding 
elements that occur less than (1 — e/26) Mb times. 

4. If S p and S q are empty, 

^-Distance- Test (p, q, 0(n 2 / 3 /e 8 / 3 ), 6/2) 

else 

i. Let if (resp., £f) be the times element i appears in 5 P (resp., 5 q ). 

ii. Reject if EieSPuSo l*f " £1 > ^/8. 

iii. Define p' as follows: Sample an element from p. If this sample is not in S p U 5 q , 
output it; otherwise, output aniG^ [n\. Define q' similarly. 

iv. £2-Distance-Test(p',q',0(n 2 / 3 /e 8 / 3 ), 5/2) 

Figure 2: Algorithm ^-Distance- Test 



2.1 Closeness in £ 2 Norm 

In this section, we analyze Algorithm ^-Distance- Test and prove Theorem (2) The statistics r p , 
r q and s in Algorithm ^-Distance- Test are estimators for the self-collision probability of p, of 
q, and of the collision probability between p and q, respectively. If p and q are statistically close, 
we expect that the self-collision probabilities of each are close to the collision probability of the 
pair. These probabilities are exactly the inner products of these vectors. In particular, if the set 
F p of samples from p is given by {F^, . . . ,F™}, then, for any pair i,j £ [m],i ^ j, we have that 



Pr 



r p — r v 



P ■ P = IIpIII- By combining these statistics, we show that r — s is an estimator for 



the desired value ||p — q|||. 

In order to analyze the number of samples required to estimate r — s to a high enough accuracy, 
we must also bound the variance of the variables s and r used in the test. One distinction to 
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make between self-collisions and collisions between p and q is that, for the self-collisions, we only 
consider samples for which i ^ j, but this is not necessary for the collisions between p and q. We 
accommodate this in our algorithm by scaling r p and r q appropriately. By this scaling and from the 
above discussion we see that E [s] = 2m 2 (p • q) and that E [r — s] = "T- 2 (||p||l + llpllll ~~ 2(p • q)) = 

™ 2 (IIp - qllD- 

A complication which arises from this scheme is that the pairwise samples are not independent. 
We use Chebyshev's inequality (see Appendix [A} to bound the quality of the approximation, which 
in turn requires that we give a bound on the variance, as we do in this section. 

Our techniques extend the work of Goldreich and Ron |32| , where self-collision probabilities are 
used to estimate £2 norm of a vector, and in turn the deviation of a distribution from uniform. In 
particular, their work provides an analysis of the statistics r p and r q above through the following 
lemma. 

Lemma 4 (|32j) Consider the random variable r p in Algorithm ^-Distance- Test. Then, E [r p ] = 
(™) • HpIII and Var (r p ) < 2(E [A]) 3 / 2 . 

We next present a tighter variance bound given in terms of the largest weight in p and q. 



Lemma 5 There is a constant c such that 

Var (r p ) < rrj. 2 ||p||| + r72 3 ||p||| < c(m 3 b 2 + m 2 b), 
Var (r q ) < m ||q||2 + m Hqlb — c \ m " + m b), and 
Var (s) < c(m 3 b 2 + m 2 b), 

where b = maxdlpHoo, ||q||oo)- 



Proof: Let F be the set {1, . . . , to}. For € F x F, define the indicator variable Cij = 1 

if the ith element of Q p and the jth element of Q q are the same. Then, the variable from the 
algorithm s pq = Ylij^iJ- Also define the notation Cij = Cij — E[Cjj]. Given these definitions, 
we can write 



Var ( J2 C ^ 

,(i,j)EFxF 



E 



E 



(i,j)£FxF 



(i,j)eFxF (i,j)^{k,l)£FxF 



< E 



{i,j)eFxF 



+ 2-E 



{i,j)^(k,i)eFxF 



m 2 (p-q) + 2-E 



(i,j)^(k,l)£FxF 



To analyze the last expectation, we use two facts. First, it is easy to see, by the definition of 
covariance, that E [CijCkj] < E [CijC^i]. Secondly, we note that Cy and Ck,i are not independent 
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only when i = k or j = I. Expanding the sum, we get 



E 



E 



(i,j),(k,l)£FxF 



E 



E 



CijCi 



i + 



E 



(i,j),(i,I)€FxF 



(i,j),(k,j)£FxF 



< E 



(i,j),(i,l)eFxF 



(i,j),(k,j)£FxF 



< cm y PeQe +PeQi — cm b / j 
ie[n] ee[n] 



< cm 3 b 2 



for some constant c. Next, we bound Var (r) similarly to Var (s) using the argument in the proof 
of Lemma S] from |32j . Consider an analogous calculation to the preceding inequality for Var (r p ) 
(similarly, for Var (r q )) where Xij = 1 for 1 < i < j < m if the ith and jth samples in F p are the 
same. Similarly to above, define X^ = — E [Xij]. Then, we get 



Var (r, n 



E 



l<i<j'<m 
\<i<j<m 



E 

l<i<j<fc<m 



E [XjjXj^] 



< 



)-Erf+4-(; n )E*>? 

7 v 7 te[n] 



te[n] 
.2, /. /,, . 



b 2 . 



□ 



< 0{m z )-b + 0{m' s 

Thus, we get the upper bound for both variances. 

Corollary 6 There is a constant c such thatV&r (r — s) < c(m 3 b 2 +m 2 b), where b = max(||p||oo, [| q[| c 

Proof: Since variance is additive for independent random variables, we get Var (r — s) < c(m 3 6 2 + 
m 2 b). □ 

Now using Chebyshev's inequality, it follows that if we choose m = 0(e~ 4 ), we can achieve an 
error probability less than 1/3. It follows from standard techniques that with 0(log i) iterations 
we can achieve an error probability at most 8. 

Finally, we can analyze the behavior of the algorithm. 

Theorem 7 For two distributions p and q such that b = maxdlpHoo, ||q||oo) and m = Q((b 2 + 
e 2 Vb~)/e i ), i/||p — q||2 < e/2, then l2-Distance-Test(p,q,m,e,5) accepts with probability at least 
1 — 5. If ||p — q||2 > e then l2-Distance-Test(p,c[,m,e,5) accepts with probability less than 5. 
The running time is 0(m log(l/5)). 

Proof: For our statistic A = (r — s), we can say, using Chebyshev's inequality and Corollary [SJ 
that for some constant c, 

c(m 3 b 2 + m?b) 



Pr[|A-E[A|| > p] < 
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Recalling that E [A] = m 2 (||p — q|| §) 5 we observe that the ^-Distance- Test can distinguish 
between the cases ||p — q| 1 2 < e/2 and ||p — q|| 2 > e if A is within m 2 e 2 /4 of its expectation. We 
can bound the error probability by 

t. n a T^r.n n/d ^ 16c(m 3 6 2 + m 2 b) 
Pr [\A - E [A] I > mV/4] < - — '-. 



Thus, for m = J7((6 2 + e 2 Vb)/e 4: ), the probability above is bounded by a constant. This error 
probability can be reduced to 5 by 0(log(l/<5)) repetitions. □ 

2.2 Closeness in L\ Norm 

The ^i-Distance-Test proceeds in two phases. The first phase of the algorithm (lines 1-3 and 
4(i)-(ii)) determines which elements of the domain are the big elements (as defined in Definition [3]) 
and estimates their contribution to the distance ||p — q||i- The second phase (lines 4(iii)-(iv)) 
filters out the big elements and invokes the ^-Distance- Test on the filtered distribution with 
closeness parameter e/(2s/n). The correctness of this subroutine call is given by Theorem [7] with 
b = 2e 2 / 3 n~ 2//3 . With these substitutions, the number of samples m is 

£( e -8/3 n 2/3) The choice 

of threshold b in ^i-Distance-Test for the weight of the big elements arises from optimizing the 
running-time trade-off between the two phases of the algorithm. 

We need to show that by using a sample of size 0(e _8 / 3 n 2 / 3 log(n/<5)), we can estimate the 
weights of each of the big elements to within a multiplicative factor of 1 + 0(e), with probability 
at least 1 — 5/2. 

Lemma 8 Let b = e 2//3 n~ 2 / 3 . In ^-Distance-Test, after taking M = 0( n ^/j"^ ) samples 
from a distribution p, we define pi = if /M . Then, with probability at least 1 — 5/2, the following 
hold for all i: (1) if pi > (1 — e/13)6, then \pi — Pi\ < ^max(pj,6), (2) if pi < (1 — e/13)6, then 
Pi<(l- e/26)6. 

Proof: We analyze two cases; we use Chernoff bounds to show that, for each i, the following 
holds: If pi > b, then 

Pi[\Pi- Pi \ > epi/26] < exp(-0(e 2 M K )) < exp(-0(e 2 M6)) < — . 

2n 



If Pi < b, then 
Pr[|pi-p<| > eb/26] < Pr 



1- 1 & 
\Pi-Pi\ > 7^— Pi 
2Qpi 



< exp(-0(e 2 6 2 MM)) < exp(-0(e 2 M6)) < — . 

2n 



The lemma follows by the union bound. □ 
Now we are ready to prove our main theorem. 



Theorem 9 For e > 1/s/n, t\- Distance- Test accepts distributions p,q such that ||p — q||i < 
max( 3 ^ 3^ , -£j=), and rejects when ||p — q||i > e, with probability at least 1 — 5. The running time 
of the test is 0(e _8 / 3 n 2 / 3 log(n/5)). 
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PROOF: Suppose items (1) and (2) from Lemma [8] hold for all i, and for both p and q. By Lemma[8j 
this event happens with probability at least 1 — 5/2. 

Let S = S p U 5 q . By our assumption, all the big elements of both p and q are in S, and no 
element that has weight less than (1 — e/13)6 in both distributions is in S. Let Ai be the i\ distance 
attributed to the elements in S; that is, Y^ieS \Pi ~ ?t|- Let A2 = ||p' — q'||i (in the case that S is 
empty, Ai = 0, p = p' and q = q'). Note that Ai < ||p — q||i. We can show that A2 < ||p — q||i, 
and ||p - q||i < 2A 1 + A 2 . 

Next, we show that the algorithm estimates Ai in a brute- force manner to within an additive 
error of e/9. By Lemma [HI the error on the ith term of the sum is bounded by 

^(max(pi, b) + max(gi, &)) < ^(p» + 5i + 2e6/13), 
2b 2d 

where the last inequality follows from that pj and qi are at least (1 — e/13)&. Consider the sum over 
i of these error terms. Notice that this sum is over at most 2/((l — e/13)6) elements in S. Hence, 
the total additive error is bounded by 

Z)^(Pi + * + 2e Vl3) < ^(2 + 46/(13 - e)) < 6/9 

since e < 2. 
Note that 

max(||p ||oo) || Q ||oo 

) < b + n- 1 < 2b for e > l/^n. So, we can use the ^-Distance- 
Test on p' and q' with m = 0(e _8 / 3 n 2 / 3 ) as shown by Theorem [TJ 

If ||p> — q|| 1 < 32 3/^ > then so are Ai and A 2 . The first phase of the algorithm clearly accepts. 
Using the fact that, for any vector v, ||v||| < ||f ||i • IM|oo> we get ||p' — q r 1 1 2 < Therefore, the 
^2-Distance-Test accepts with probability at least 1 — 5/2. Similarly, if ||p — q||i > e, then either 
Ai > e/4 or A2 > e/2. Either the first phase of the algorithm or the ^-Distance- Test will reject. 

To see the running time bound, note that the time for the first phase is 0(n 2 / 3 e~ 8 / 3 log(n/<5)) 
and that the time for ^2-Distance-Test is 0(ra 2 / 3 e~ 8 / 3 log |). It is easy to see that our algorithm 
makes an error either when it makes a bad estimation of Ai or when ^-Distance- Test makes an 
error. So, the probability of error is bounded by 5. □ 

The next theorem improves this result by looking at the dependence of the variance calculation 
in Section 12.11 on norms of the distributions separately. 

Theorem 10 Given two black-box distributions p, q over [n], with ||p||oo < || q| 1 00 , there is a test 
requiring ©((ra^lpHooHqllooe -4 + ny^lqllooe" 2 ) log(l/<5)) samples that (1) «/ ||p — q||i < it 
accepts with probability at least 1 — 5 and (2) if ||p — q||i > e, it rejects with probability at least 
1-5. 

2.3 Testing l\ Distance from Uniformity 

A special case of Theorem [2] gives a constant-time algorithm which provides an additive approx- 
imation of the £2 distance of a distribution from the uniform distribution. For the problem of 
testing that p is close to the uniform distribution in i\ distance (i.e., testing closeness when q is 
the uniform distribution), one can get a better sample complexity dependence on n. 

Theorem 11 Given e < 1 and a black-box distribution p over [n], there is a test that takes 0(e -4 • 
y/n ■ log {1/5}) samples, accepts with probability at least 1 — 5 if ||p — f7r n i||i < e/y/3n, and rejects 
with probability at least 1 — 5 if ||p — E/[ n i||i > e. 
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The proof of Theorem [IT] relies on the following lemma, which can be proven using techniques from 
Goldreich and Ron |32j (see also Lemma [5] in this paper). 

Lemma 12 Given a black-box distribution p over [n], there is an algorithm that takes 0(e~ 2 ■ y/n- 
log (1/(5)) samples and estimates ||p|| 2 within an error o/e||p|||, with probability at least 1 — 5. 

Proof: [of Lemma [T2"] Consider the random variable r p from the ^-Distance- Test. Since E [r p ] = 
C2) ' IIpII!) we on ly need to show that it does not deviate from its expectation too much with high 
probability. Again, using Chebyshev's inequality and Lemma EJ 

„ n !, p 1 0(m 2 ||p||9 + m 3 ||p||n) 1 

Pr [\r p - E [r p ] > eE [r p ]] < V "f 2 „ 4 < j, 

e z m 4 ||p||2 4 

where the last inequality follows for m = 0(e~ 2 y/n) from the fact that ||p||2 > n -1 / 2 . The confi- 
dence can be boosted to 1 — 5 using 0(log(l/(5)) repetitions. □ 

We note that, for an additive approximation of ||p||2, an analogous argument to the proof above 
will yield an algorithm that uses 0(e -4 ) samples. 

Proof: [of Theorem 111] The algorithm, given in Figure O estimates [[pH 2 within e 2 1 1 F» 1 1 2 
using the algorithm from Lemma[l2]and accepts only if the estimate is below (1 + 3e 2 /5)/n. 

Uniformity-Distance-Test (p, m, e, S) 

1. Accept if GR-Uniformity-^2-Distance-Test(p, e 2 /5) returns an estimate at most (1 + 
3e 2 /5)/n. 

2. Otherwise, reject. 

Figure 3: Algorithm Uniformity-Distance- Test 

First, observe the following relationship between the £2 distance to the uniform distribution and 
the collision probability. 

Hp - u [n] \\l = Y^ipi - -? = ^ZpI - - ■ J^Pi + - = IIpIII - - (i) 

i 

If ||p - U[ n ]\\i < e/\/3n, then ||p - f/[ n ][|| < e 2 /3n. Using (fl]), we see that ||p||| < (1 + e 2 /3)/n. 
Hence, for e < 1, the estimate will be below (l + e 2 /5)(l + e 2 /3)/n < (l + 3e 2 /5)/n with probability 
at least 1 — 5. 

Conversely, suppose the estimate of HpH 2 . is below (1 + 3e 2 /5)/n. By Lemma \V2\ [| p 111 — 
(1 + 3e 2 /5)/((l - e 2 /5)n) < (1 + e 2 )/n for e < 1. Therefore, by CD), we can write 

II tt Il2 11 ||2 1^2/ 

P - U {n] 2 = P 2 < e /n. 

So, we have ||p — C/r n i||2 < ej\fn. Finally, by the relation between l\ and £2 norms, ||p — E/r n i||i < e. 

The sample complexity of the procedure will be 0(e _4 - 1 /ndog (1/5)), arising from the estimation 
of ||p|| 2 within e 2 ||p|| 2 /5. □ 
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3 Lower Bounding the Sample Complexity 



In this section we consider lower bounds on the sample complexity of testing closeness of distribu- 
tions. In a previous version of this paper [9], we claimed an almost matching S7(n 2 / 3 ) lower bound 
on the sample complexity for testing the closeness of two arbitrary distributions. Although it was 
later determined that there were gaps in the proofs, recent results of [33J have shown that the 
in fact the almost matching lower bounds do hold. Although new proof techniques were needed, 
certain technical ideas such as "Poissonization" and the characterization of "canonical forms of 
testing algorithms" that first appeared in the earlier version of this work did in fact turn out to be 
useful in the correct lower bound proof of [33]. We will outline those ideas in this section. 

We begin by discussing a characterization of canonical algorithms for testing properties of 
distributions. Then we describe a pair of families of distributions that were suggested in the earlier 
version of this work, and were in fact used by Valiant [31] in showing the correct lower bound. Next, 
we investigate the required dependence on e. Finally, we briefly consider naive learning algorithms, 
which can be defined as algorithms that, given samples from a distribution, output a distribution 
with small distance to the input distribution. We show that naive learning algorithms require 
fi(n) samples. We also note that, more recently, the dependency of testing uniformity on distance 
parameter e and n has been tightly characterized to be 0(y / n/e 2 ) by Paninski |45j . 

3.1 Characterization of Canonical Algorithms for Testing Properties of Distri- 
butions 

In this section, we characterize canonical algorithms for testing properties of distributions defined 
by permutation-invariant functions. The argument hinges on the irrelevance of the labels of the 
domain elements for such a function. We obtain this canonical form in two steps, corresponding to 
the two lemmas below. The first step makes explicit the intuition that such an algorithm should 
be symmetric, that is, the algorithm would not benefit from discriminating among the labels. In 
the second step, we remove the use of labels altogether, and show that we can present the sample 
to the algorithm in an aggregate fashion. Raskhodnikova et al. [H] use this chararecterization of 
canonical algorithms for proving lower bounds on the sample complexity of distribution support 
size and element distinctness problems. 

Characterizations of property testing algorithms have been studied in other settings. For ex- 
ample, using similar techniques, Alon et al. [2j show a canonical form for algorithms for testing 
graph properties. Later, Goldreich and Trevisan |29] formally prove the result by Alon et al. In 
a different setting, Bar-Yossef et al. [6] show a canonical form for sampling algorithms that ap- 
proximate symmetric functions of the form / : A n — > B where A and B are arbitrary sets. In the 
latter setting, the algorithm is given oracle access to the input vector and takes samples from the 
coordinate values of this vector. 

Next, we give the definitions of basic concepts on which we build a characterization of canonical 
algorithms for testing properties of distributions. Then, we describe and prove our characterization. 

Definition 13 (Permutation of a distribution) For a distribution p over [n] and a permuta- 
tion ir on [n], define vr(p) to be the distribution such that for all i, ^(p)^) = Pi- 

Definition 14 (Symmetric Algorithm) Let A be an algorithm that takes samples from k dis- 
crete black-box distributions over [n] as input. We say that A is symmetric if, once the distributions 
are fixed, the output distribution of A is identical for any permutation of the distributions. 
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Definition 15 (Permutation- invariant function) A k-ary function f on distributions over [n] 
is permutation- invariant if for any permutation ir on [n], and all distributions (p^, . . . , p^), 

/(p( 1 ),...,p( fc )) = /(vr(p( 1 )),...,7r(pW)). 

Lemma 16 Let A be an arbitrary testing algorithm for a k-ary property V defined by a permutation- 
invariant function. Suppose A has sample complexity s(n), where n is the domain size of the dis- 
tributions. Then, there exists a symmetric algorithm that tests the same property of distributions 
with sample complexity s(n). 

Proof: Given the algorithm A, construct a symmetric algorithm A' as follows: Choose a random 
permutation of the domain elements. Upon taking s(n) samples, apply this permutation to each 
sample. Pass this (renamed) sample set to A and output according to A. 

It is clear that the sample complexity of the algorithm does not change. We need to show that 
the new algorithm also maintains the testing features of A. Suppose that the input distributions 
(p^, . . . , p( fe )) have the property V. Since the property is defined by a permutation-invariant 
function, any permutation of the distributions maintains this property. Therefore, the permutation 
of the distributions should be accepted as well. Let S n denote the set of all permutations on [n]. 
Then, 



Pr 



^'accepts (p (1) ,...,p (fc) )l = Yl ^PrU accepts (tt(p«), . . . , tt( P ^)) 

nes n n ' 

which is at least 2/3 by the accepting probability of A. 

An analogous argument on the failure probability for the case of the distributions (p^, . . . , p( fe )) 
that should be rejected completes the proof. □ 

In order to avoid introducing additional randomness in A', we can try A on all possible permu- 
tations and output the majority vote. This change would not affect the sample complexity, and it 
can be shown that it maintains correctness. 

Definition 17 (Fingerprint of a sample) Let S± and S2 be multisets of at most s samples taken 
from two black-box distributions over [n], p and q, respectively. Let the random variable Cij, for 
< i,j < s, denote the number of elements that appear exactly i times in S\ and exactly j times 
in S2- The collection of values that the random variables {Cij}o<i,j<s take is called the fingerprint 
of the sample. 

For example, let sample sets be Si = {5,7,3,3,4} and S2 = {2,4,3,2,6}. Then, C10 = 2 
(elements 5 and 7), Cqi = 1 (element 6), Cn = 1 (element 4), C02 = 1 (element 2), C21 = 1 
(element 3), and for remaining i, j's, = 0. 

Lemma 18 If there exists a symmetric algorithm A for testing a binary property of distributions 
defined by a permutation-invariant function, then there exist an algorithm for the same task that 
gets as input only the fingerprint of the sample that A takes. 

Proof: Fix a canonical order for C^'s in the fingerprint of a sample. Let us define the following 
transformation on the sample: Relabel the elements such that the elements that appear exactly 
the same number of times from each distribution (i.e., the ones that contribute to a single in 
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the fingerprint) have consecutive labels and the labels are grouped to conform to the canonical 
order of Cy's. Let us call this transformed sample the standard form of the sample. Since the 
algorithm A is symmetric and the property is defined by a permutation-invariant function, such a 
transformation does not affect the output of A. So, we can further assume that we always present 
the sample to the algorithm in the standard form. 

It is clear that given a sample, we can easily write down the fingerprint of the sample. Moreover, 
given the fingerprint of a sample, we can always construct a sample (Si,^) in the standard form 
using the following algorithm: (1) Initialize S\ and £2 to be empty, and e = 1, (2) for every CV,- 
in the canonical order, and for Cy = kij times, include i and j copies of the element e in Si and 
£2, respectively, then increment e. This algorithm shows a one-to-one and onto correspondence 
between all possible sample sets in the standard form and all possible {Cij}o<ij< s values. 

Consider the algorithm A' that takes the fingerprint of a sample as input. Next, by using 
algorithm from above, algorithm A' constructs the sample in the standard form. Finally, A' outputs 
what A outputs on this sample. □ 

Remark 19 Note that the definition of the fingerprint from Definitional^ can be generalized for a 
collection of k sample sets from k distributions for any k. An analogous lemma to Lemma\T^ can be 
proven for testing algorithms for k-ary properties of distributions defined by a permutation-invariant 
function. We fixed k = 2 for ease of notation. 

3.2 Towards a Lower Bound on the Sample Complexity of Testing Closeness 

In this section, we present techniques that were later used by Valiant |54j to prove a lower bound on 
the sample complexity of testing closeness in i\ distance as a function of the size n of the domain 
of the distributions. We give a high-level description of the proof, indicate where our reasoning 
breaks down and where Valiant [M] comes in. 

Theorem 20 (|54j) Given any algorithm using only o(ra 2 / 3 ) samples from two discrete black-box 
distributions over [n], for all sufficiently large n, there exist distributions p and q with t\ distance 
1 such that the algorithm will be unable to distinguish the case where one distribution is p and the 
other is q from the case where both distributions are p. 

By Lemma [161 we may restrict our attention to symmetric algorithms. Fix a testing algorithm 
A that uses o(n 2 / 3 ) samples from each of the input distributions. 

Let us assume, without loss of generality, that n is a multiple of four and n 2 / 3 is an integer. We 
define the distributions p and q as follows: (1) For 1 < i < ra 2 / 3 , p% = q% = 2 J, /3 . We call these 
elements the heavy elements. (2) For n/2 < i < 3n/4, pi = - and qi = 0. We call these element 
the light elements of p. (3) For 3n/4 < i < n, qi = — and pi = 0. We call these elements the light 
elements of q. (4) For the remaining i's, pi = qi = 0. Note that these distributions do not depend 
on A. 

The l\ distance of p and q is 1. Now, consider the following two cases: 

Case 1: The algorithm is given access to two black-box distributions: both of which output 
samples according to the distribution p. 

Case 2: The algorithm is given access to two black-box distributions: the first one out- 
puts samples according to the distribution p and the second one outputs samples 
according to the distribution q. 
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To get a sense of why these distributions should be hard for any distance testing algorithm, note 
that when restricted to the heavy elements, both distributions are identical. The only difference 
between p and q comes from the light elements, and the crux of the proof is to show that this differ- 
ence will not change the relevant statistics in a statistically significant way. For example, consider 
the statistic which counts the number of elements that occur exactly once from each distribution. 
One would like to show that this statistic has a very similar distribution when generated by Case 1 
and Case 2, because the expected number of such elements that are light is much less than the 
standard deviation of the number of such elements that are heavy. 

Our initial attempts at formalizing the intuition above were incomplete. However, completely 
formalizing this intuition, Valiant [54] subsequently showed that a symmetric algorithm with sample 
complexity o(n 2//3 ) can not distinguish between these two cases. By Lemma \16\ the theorem follows. 

Poissonization For simplifying the proof, it would be useful to have the frequency of each element 
be independent of the frequencies of the other elements. To achieve this, we assume that algorithm 
A first chooses two integers s\ and S2 independently from a Poisson distribution with the parameter 
A = s = o(n 2//3 ). The Poisson distribution with the positive parameter A has the probability mass 
function p(k) = exp(— A)A fc /fc!. Then, after taking s\ samples from the first distribution and S2 
samples from the second distribution, A decides whether to accept or reject the distributions. In 
the following, we give an overview of the proof that A cannot distinguish between Case 1 and Case 2 
with success probability at least 2/3. Since both si and S2 will have values larger than s/2 with 
probability at least 1 — o(l) and the statistical distance of the distributions of two random variables 
(i.e., the distributions on the samples) is bounded, it will follow that no symmetric algorithm with 
sample complexity s/2 can. 

Let Fi be the random variable corresponding to the number of times the element i appears in 
the sample from the first distribution. Define G{ analogously for the second distribution. It is well 
known that Fi is distributed identically to the Poisson distribution with parameter A = sr, where 
r is the probability of element i (cf., Feller (|25|. p. 216). Furthermore, it can also be shown that 
all FiS are mutually independent. Thus, the total number of samples from the heavy elements and 
the total number of samples from the light elements are independent. 

Canonical Testing Algorithms Recall the definition of the fingerprint of a sample from Sec- 
tion [37TJ The random variable Cij, denotes the number of elements that appear exactly i times in 
the sample from the first distribution and exactly j times in the sample from the second distribu- 
tion. We can then assume that the algorithm is only given the fingerprint of the sample, and apply 
Lemma 1181 

Arguing in this way can lead to several subtle pitfalls, which Valiant's proof [54] circumvents by 
developing a body of additional, very nontrivial, technical machinery to show that the distributions 
on the fingerprint when the samples come from Case 1 or Case 2 are indistinguishable. 

3.3 Other Lower Bounds 

In this section, we first give two lower bounds for the sample complexity of testing closeness in 
terms of the distance parameter e. Then, we show that a naive learning algorithm for distributions 
require Q(n) samples. 

By appropriately modifying the distributions p and q from the proof, we can give a stronger 
version of Theorem 1201 with a dependence on e. 
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Corollary 21 Given any test using only o(re 2//3 /e 2//3 ) samples, there exist distributions a and b of 
t\ distance e such that the test will be unable to distinguish the case where one distribution is a and 
the other is b from the case where both distributions are a. 

We can get a lower bound of Q(e~ 2 ) for testing the I2 distance with a rather simple proof. 

Theorem 22 Given any test using only o(e -2 ) samples, there exist distributions a and b of £2 
distance e such that the test will be unable to distinguish the case where one distribution is a and 
the other is b from the case where both distributions are a. 

Proof: Let n = 2, a\ = a 2 = 1/2 and b\ = 1/2 — e/\/2 and 62 = 1/2 + e/\/2. Distinguishing 
these distributions is exactly the question of distinguishing a fair coin from a coin of bias 0(e) 
which is well known to require 0(e~ 2 ) coin flips. □ 

The next theorem shows that learning a distribution using sublinear number of samples is not 
possible. 

Theorem 23 Suppose we have an algorithm that draws o(n) samples from some unknown distri- 
bution b and outputs a distribution c. There is some distribution h for which the output c is such 
that h and c have l\ distance close to one. 

Proof: (Sketch) Let Ag be the distribution that is uniform over S C {l,...,n}. Pick S 
at random among sets of size n/2 and run the algorithm on Ag. The algorithm only learns o(n) 
elements from S. So with high probability the l\ distance of whatever distribution the algorithm 
output will have l\ distance from A$ of nearly one. □ 

4 Applications to Markov Chains 

Random walks on Markov chains generate probability distributions over the states of the chain, in- 
duced by the endpoints of the random walks. We employ ^i-Distance-Test, described in Sectional 
to test mixing properties of Markov Chains. 

This application of ^-Distance- Test is initially inspired by the work of Goldreich and Ron |32j . 
which conjectured an algorithm for testing expansion of bounded-degree graphs. Their algorithm 
is based on comparing the distribution of the endpoints of random walks on a graph to the uni- 
form distribution via collisions. Subsequently to this work, Czumaj and Sohler [22], Kale and 
Seshadri |37| . and Nachmias and Shapira [43] have independently concluded that the algorithm of 
Goldreich and Ron is provably a test for expansion property of graphs. 

4.1 Preliminaries and Notation 

Let M be a Markov chain represented by the transition probability matrix M. The point distribu- 
tion nth state of M corresponds to an n-vector e u = (0, . . . , 1, . . . , 0), with a one in only the uth. 
location and zeroes elsewhere. The distribution generated by t-step random walks starting at state 
u is denoted as a vector-matrix product e u M*. 

Instead of computing such products in our algorithms, we assume that our ^i-Distance-Test 
has access to an oracle, next_node which on input of the state u responds with the state v with 
probability M(u, v). Given such an oracle, the distribution e^M 1 can be generated in 0(t) steps. 
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Furthermore, the oracle itself can be realized in O(logn) time per query, given linear preprocessing 
time to compute the cumulative sums M c (j, k) = Y^=i i). The oracle can be simulated on 
input u by producing a random number a in [0, 1] and performing binary search over the nth row 
of M c to find v such that M c (u,u) < a < M c (u, v + 1). It then outputs state v. Note that 
when M is such that every row has at most d nonzero terms, slight modifications of this yield an 
0(log d) implementation consuming 0{n + m) words of memory if M is n x n and has m nonzero 
entries. Improvements of the work given in [55] can be used to prove that in fact constant query 
time is achievable with space consumption 0{n + m) for implementing next_node, given linear 
preprocessing time. 

We define a notion of closeness between states u and v, based on the distributions of endpoints 
of t step random walks starting at u and v respectively. 

Definition 24 We say that two states u and v are (e, £)-close if the distribution generated by t-step 
random walks starting at u and v are within e in the L\ norm, i.e. ||e u M — e.yM*||i < e. Similarly 
we say that a state u and a distribution s are (e,t)-close if ||e u M* — s||i < e. 

We say M is (e, t) -mixing if all states are (e, t)-close to the same distribution: 

Definition 25 A Markov chain M is (e, t)-mixing if a distribution s exists such that for all states 
u, ||e M M* — s||i < e. 

For example, if M is (e, 0(log n log l/e))-mixing, then M is rapidly-mixing |53| . It can be easily 
seen that if M is (e, to) _m i x i n g then it is (e,t) mixing for all t > to. 
We now make the following definition: 

Definition 26 The average t-step distribution, sm,( of a Markov chain M with n states is the 
distribution 

SM,t = - Ve M M*. 
n 

u 

This distribution can be easily generated by picking u uniformly from [n] and walking t steps from 
state u. In an (e, t)-mixing Markov chain, the average t-step distribution is e-close to the stationary 
distribution. In a Markov chain that is not (e, t)-mixing, this is not necessarily the case. 

Each test given below assumes access to an t\ distance tester ^-Distance- Test(u, v, e, 5) which 
given oracle access to distributions e u , e v over the same n element set decides whether ||e u — e„||i < 
/(e) or if ||e u — e„||i > e with confidence 1 — 5. The time complexity of Li_test is T(n,e,5), and 
/ is the gap of the tester. The implementation of £i-Distance-Test given earlier in Section [2] has 
gap /(e) = e/{Ay/n), and time complexity T = 0(e~ 8 / 3 n 2 / 3 log |). 

4.2 A Test for Mixing and a Test for Almost-Mixing 

We show how to decide if a Markov chain is (e, t)-mixing; then, we define and solve a natural 
relaxation of that problem. 

In order to test whether M is (e, t)-mixing, one can use .^-Distance- Test to compare each 
distribution e u M* with 8m^, with error parameter e and confidence S/n. The running time is 
0(nt ■ T(n, e, S/n)). The algorithm is given in Figured) 

The behavior of the test is as follows: If every state is (/(e)/2, t)-close to some distribution s, 
then SM,t is /(e)/2-close to s. Therefore every state is (e,t)-close to s\i,t and the tester passes. On 



16 



Mixing(M, t, e, 5) 



1. For each state u in M 

Reject if £i-Distance-Test(e u M', sm,(, e, 5/n) rejects. 

2. Otherwise, accept. 

Figure 4: Algorithm Mixing 

the other hand, if there is no distribution that is (e, t)-close to all states, then, in particular, Sm t 
is not (e, t)-close to at least one state and so the tester fails. Thus, we have shown the following 
theorem. 

Theorem 27 Let M be a Markov chain. Given l\-Distance-Test with time complexity T(n,€,5) 
and gap f and an oracle for next _node, there exists a test with time complexity 0(nt ■ T(n, e, 5 /n)) 
with the following behavior: IfM. is (/ '(e)/ '2, t) -mixing then Pr [M is accepted] >l — 5;ifM. is not 
(e,t)-mixing then Pr [M is accepted] < 5. 

For the implementation of ^-Distance- Test given in Section the running time of Mixing 
algorithm is 0(e _8//3 re 5//3 t log j). It distinguishes between chains which are e/(4y / n) mixing and 
those which are not e-mixing. The running time is sublinear in the size of M if t € 0(^/3/ log(n)). 

A relaxation of this procedure is testing that most starting states reach the same distribution 
after t steps. If (1 — p) fraction of the states u of a given M satisfy \\s — e u M*||i < e, then we 
say that M is (p,e,t) -almost mixing. The algorithm in Figure [5] tests whether a Markov chain is 
(p, e, t)-almost mixing. 

AlmostMixing(M, t, e, 5, p) 

Repeat 0(1/ p ■ ln(l/<5)) times 

1. Pick a state u in M uniformly at random. 

2. Reject if ^i-Distance-Test(e u M*, SM,t, e, 5p) rejects. 
Accept if none of the tests above rejected. 

Figure 5: Algorithm AlmostMixing 

Thus, we obtain the following theorem. 

Theorem 28 Let M be a Markov chain. Given £\-Distance-Test with time complexity T(n,e,S) 
and gap f and an oracle for nextjiode, there exists a test with time complexity 0(-T(n, e, 5p) log j) 
with the following behavior: 7/M is (p, f(e)/2,t)-almost mixing then Pr [M is accepted] > 1 — 5; If 
M is not (p,e,t)-almost mixing then Pr [M is accepted] < 5. 

4.3 A Property Tester for Mixing 

The main result of this section is a test that determines if a Markov chain's matrix representation 
can be changed in an e fraction of the non-zero entries to turn it into a (4e, 2t)-mixing Markov 
chain. This notion falls within the scope of property testing [501 SOI EH [231 EJ; which in general 
takes a set S with distance function A and a subset PCS and decides if an elements x S S is in 
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P or if it is far from every element in P, according to A. For the Markov chain problem, we take 
as our set S all matrices M of size nxn with at most d non-zero entries in each row. The distance 
function is given by the fraction of non-zero entries in which two matrices differ, and the difference 
in their average i-step distributions. 

4.3.1 Preliminaries 

We start with defining a distance function on a pair of Markov chains on the same state space. 

Definition 29 Let Mi and M2 be n-state Markov chains with at most d non-zero entries in each 
row. Define distance function A(Mi, M2) = (ei, €2) if and only i/Mi and M2 differ on e±dn entries 
and ||sMi,i — SM 2) f 111 = e 2- We say that'M.i and M2 are (ei, £2)-close if A (Mi, M2) < (ei , £2) JH 

A natural question is whether all Markov chains are e-close to an (e, t)-mixing Markov chain, for 
certain parameters of e. For example, given a strongly connected and dense enough Markov chain, 
adding the edges of a constant-degree expander graph and choosing t = O(logn) yields a Markov 
chain which (e,t)-mixes. However, for sparse Markov chains or small e, such a transformation 
does not work. Furthermore, the situation changes when asking whether there is an (e, i)-mixing 
Markov chain that is close both in the matrix representation and in the average i-step distribution: 
specifically, it can be shown that there exist constants e, ei,e2 < 1 and Markov chain M for which 
no Markov chain is both (ei, €2)-close to M and (e, log n)-mixing. In fact, when e\ is small enough, 
the problem becomes nontrivial even for 62 = 1. The Markov chain corresponding to random walks 
on the n-cycle provides an example which is not (i~ 1//2 , l)-close to any (e,t)-mixing Markov chain. 

Overview As before, our algorithm proceeds by taking random walks on the Markov chain and 
comparing final distributions by using the ^i-Distance-Test. We define three types of states. First, 
a normal state is one from which a random walk arrives at nearly the average t-step distribution. 
In the discussion which follows, t and e denote constant parameters fixed as input to the algorithm. 

Definition 30 Given a Markov Chain M, a state u of the chain is normal if it is (e,t)-close to 
s M,t- That is if ||e u M* — sm,<||i < £• ^4 state is bad if it is not normal. 

Testing normality requires time 0(t ■ T(n,e,S)). Using this definition, the first two algorithms 
given in this section can be described as testing whether all (resp. most) states in M are normal. 
Additionally, we need to distinguish states which not only produce random walks which arrive near 
s M,t but which have low probability of visiting a bad state. We call such states smooth states. 

Definition 31 A state e u in a Markov chain M is smooth if (a) u is (e,r)-close to syi,t f or 
t = t, . . . , 2t and (b) the probability that a 2t-step random walk starting at e u visits a bad state is 
at most e. 

Testing smoothness of a state requires 0(t 2 • T(n,e,5)) time. Our property test merely verifies by 
random sampling that most states are smooth. 

2 We say (x, y) < (a, b) if x < a and y < b. 
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4.3.2 The Test 



We present below algorithm TestMixing in Figure [6l which on input Markov chain M and param- 
eter e determines whether at least (1 — e) fraction of the states of M are smooth according to two 
distributions: uniform and the average i-step distribution. Assuming access to ^-Distance- Test 
with complexity T(n,e,S), this test runs in time 0(e- 2 t 2 T(n,e, ^)). 

TestMixing(M, t, e) 

1. Let k = 9(l/e). 

2. Choose k states tii, . . . , uniformly at random. 

3. Choose k states Uk+i, ■ ■ ■ , ^2fc independently according to sm,(. 

4. For i = 1 to 2k 

(a) u = e Ui . 

(b) For w = 1 to 0(1/ e) and j = 1 to 2t 

i. u = next_node(M, u) 

ii. £i-Distance-Test(e u M', sm,(, e, ^) 

(c) For t = t to 2t, ^i-Distance-Test(e Ui M T , sm,*, e, ^) 

5. Pass if all tests pass. 

Figure 6: Algorithm TestMixing 

The main lemma of this section says that any Markov chain that is accepted by our test is 
(2e, 1.01e)-close to a (4e, 2t)-mixing Markov chain. First, we describe the modification of M that 
we later show is (4e, 2t)-mixing. 

Definition 32 F is a function from n x n matrices to n x n matrices such that .F(M) returns M 
by modifying the rows corresponding to bad states o/M to e u , where u is any smooth state. 

An important feature of the transformation F is that it does not affect the distribution of random 
walks originating from smooth states very much. 

Lemma 33 Given a Markov chain M and any state u G M which is smooth. 7/M = .F(M), then, 
for any time t < r < It, ||e u M r — e u M r ||i < e and ||sm,< — e u M T ||i < 2e. 

Proof: Define T as the set of all walks of length r from u in M. Partition T into Tb and Tb 
where is the subset of walks which visit a bad state. Let Xw,i be an indicator function which 
equals 1 if walk w ends at state i, and otherwise. Let weight function W(w) be defined as the 
probability that walk w occurs. Finally, define the primed counterparts V, etc. for the Markov 
chain M. Now the ith element of e M M r is J2wer B Xw,i ' W(w) + YlweT B Xw,i ' W(w). A similar 
expression can be written for each element of e u M r . Since W(w) = W'(w) whenever w £ Tb it 
follows that \\e u M T - ejMTh < E< E^ e r s X w ,i\W(w) - W'(w)\ < Zi E, e r fl X w ,iW(w) < e. ^ 

Additionally, since || — e u ]V[ T || \ ^6 by the definition of smooth, it follows that ||sm^ — e. u 
||8M,t - e u M r ||i + ||e u M r - e u M r ||i < 2e. □ 

We can now prove the main lemma. 
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Lemma 34 If according to both the uniform distribution and the distribution sm,j, (1 — e) fraction 
of the states of a Markov chain M are smooth, then the matrix M is (2e, 1.01e)-close to a matrix 
M which is (4e, 2£) -mixing. 

Proof: Let M = F(M). M and M differ on at most en(d + 1) entries. This gives the first 
part of our distance bound. For the second we analyze ||sM,t — \\i = ^ Yl u 1 1 e, u ]VE £ — e„M*||i 
as follows. The sum is split into two parts, over the nodes which are smooth and those nodes 
which are not. For each of the smooth nodes u, Lemma [33] says that ||e u M* — e n M t ||i < e. 
Nodes which are not smooth account for at most e fraction of the nodes in the sum, and thus can 
contribute no more than e absolute weight to the distribution sj^ . The sum can be bounded now 

by \\s M ,t ~ S M,t^ -n(0-~ e ) ne + en ) - 2e - 

In order to show that M is (4e, 2i)-mixing, we prove that for every state u, ||sM,t — e u M 2 *||i < 
4e. The proof considers three cases: u smooth, u bad, and u normal. The last case is the most 
involved. 

If u is smooth in the Markov chain 1VI, then Lemma 133 1 immediately tells us that ||s]yi t — e u M 2 *||i < 
2e. Similarly if u is bad in the Markov chain M, then in the chain M any path starting at u tran- 
sitions to a smooth state v in one step. Since ||sM,t — e ^M 2 * _1 ||i < 2e by Lemma [33], the desired 
bound follows. 

If e u is a normal state which is not smooth, then we need a more involved analysis of the 
distribution e u M 2 *. We divide T, the set of all 2t-step walks in M starting at u, into three sets, 
which we consider separately. 

For the first set take Tb C T to be the set of walks which visit a bad node before time t. Let 
d b be the distribution over endpoints of these walks, that is, let d b assign to state i the probability 
that any walk w S Tb ends at state i. Let w £ Tb be any such walk. If w visits a bad state at 
time t < t, then in the new Markov chain M, w visits a smooth state v at time r + 1. Another 
application of Lemma [33l implies that ||e^M 2 * _T_1 — SM,t||i < 2e. Since this is true for all walks 
w £ Tb, we find [|d& — SM,t||i < 2e. 

For the second set, let T$ Q T \ Tb be the set of walks not in Tb which visit a smooth state 
at time t. Let d s be the distribution over endpoints of these walks. Any walk w G Tg is identical 
in the chains M and M up to time t, and then in the chain M visits a smooth state v at time t. 
Thus since He^M* — sm,*||i < 2e, we have ||d s — SM,t||i < 2e. 

Finally, let Tjv = T \ (Tb U r^), and let d n be the distribution over endpoints of walks in Tjy. 
r^r consists of a subset of the walks from a normal node u which do not visit a smooth node at 
time t. By the definition of normal, u is (e, £)-close to SM,t in the Markov chain M. By assumption 
at most e weight of SM,t is assigned to nodes which are not smooth. Therefore |rjv|/|r| is at most 
e + e = 2e. 

Now define the weights of these distributions as uib, uj s and ui n . That is w& is the probability that 
a walk from u in M visits a bad state before time t. Similarly uj s is the probability that a walk does 
not visit a bad state before time t, but visits a smooth state at time t, and uj n is the probability 
that a walk does not visit a bad state but visits a normal, non-smooth state at time t. Then, 
uj b + uj s + uj n = 1. Finally, ||e„M 2i - s M ,t||i = ||w fe d fc + u; s d s + u n d n - s M ,t||i < u; b \\d b - s M ,t||i + 
uj s \\d s - SM,i||i+Wn||d n - Sm,*||i < (w 6 + w s ) max{||d 6 - s M ,t||i, ||d s - s M ,t||i} + ^n||d n - s M ,t||i < 
4e. □ 

Given this, we finally can show our main theorem. 
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Theorem 35 Let M be a Markov chain. Given H.\-Distance-Test with time complexity T(n, e, 5) 
and gap f and an oracle for next jiode, there exists a test such that if M is ( / (e) , t) -mixing 
then the test accepts with probability at least 2/3. // M is not (2e, 1.01e)-close to any M which 
is (4e, 2i) -mixing then the test rejects with probability at least 2/3. The runtime of the test is 
0(^-t 2 -T(n, e ,i)). 

Proof: Since in any Markov chain M which is (e, i)-mixing all states are smooth, M accepts 
this test with probability at least (1—5). Furthermore, any Markov chain with at least (1— e) fraction 
of smooth states is (2e, 1.01e)-close to a Markov chain which is (4e, 2i)-mixing, by Lemma [34l □ 

4.4 Extension to Sparse Graphs and Uniform Distributions 

The property test can also be made to work for general sparse Markov chains by a simple modifi- 
cation to the testing algorithms. Consider Markov chains with at most m <C n 2 nonzero entries, 
but with no nontrivial bound on the number of nonzero entries per row. Then, the definition of 
the distance should be modified to A(Mx, M2) = (£1,62) if Mi and M2 differ on e± ■ m entries and 
the ||sMi,t — s M 2 ,t||i = e 2- The above test does not suffice for testing that M is (ei, £2)-close to 
an (e, i)-mixing Markov chain M, since in our proof, the rows corresponding to bad states may 
have many nonzero entries and thus M and M may differ in a large fraction of the nonzero entries. 
However, let D be a distribution on states in which the probability of each state is proportional to 
cardinality of the support set of its row. Natural ways of encoding this Markov chain allow constant 
time generation of states according to D. By modifying the algorithm to also test whether most 
states according to D are smooth, one can show that M is close to an (e, t)-mixing Markov chain 
M. 

Because of our ability to test e-closeness to the uniform distribution in 0(n l / 2 e~ 2 ) steps [32], it 
is possible to speed up our test for mixing for those Markov chains known to have uniform stationary 
distribution, such as Markov chains corresponding to random walks on regular graphs. An ergodic 
random walk on the vertices of an undirected graph instead may be regarded (by looking at it "at 
times t + 1/2") as a random walk on the edge-midpoints of that graph. The stationary distribution 
on edge-midpoints always exists and is uniform. So, for undirected graphs we can speed up mixing 
testing by using a tester for closeness to the uniform distribution. 
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A Chebyshev's Inequality 

Chebyshev's inequality states that for any random variable A, and p > 0, 

Pt[\A-E[A]\ > P }<^P-. 
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